Sentiment Analysis

  1. Read Ch 2 on sentiment analysis as well as sentiment analysis section of Ch 9 to understand how to approach lexicon-based sentiment analysis using tidytext

  2. Using the twitter data from your previous assignment with Biden and Trump tweets , perform sentiment analysis using the afinn lexicon. The tweets dataset has been attached..

  3. Compare sentiment between each candidate with full analysis and visualizations. Show most positive /negative words between each candidate. You can use bar blots and/or comparison.cloud() to group words by positive/negative sentiment. See ch 2 of your reading and sample notebook The sample notebook has specific instructions for transformations.

  4. Extra Credit: (+ 5) points. If you do this, add an extra credit section as a header. Perform sentiment analysis using bing lexicon.

Load Libraries and Data

# twitter library 
library(rtweet)

# plotting and pipes
library(tidyverse)
library(ggplot2)
library(dplyr)
library(stringr)
library(tidyr)

# text mining library
library(tm)
library(tidytext)
library(wordcloud)
library(reshape2)
library(textstem)

# date/time library
library(lubridate)
df <- read.csv("candidates.csv")
# Convert the date field to a datetime
df$created_at <- as_datetime(df$created_at)

#Changed some fields to factors for easier manipulation later
df$user_id <- as.factor(df$user_id)
df$status_id <- as.factor(df$status_id)
df$screen_name <- as.factor(df$screen_name)

# Fix up the reply count field.  It should be a int and NAs set to 0
df$reply_count[is.na(df$reply_count)] <- 0
df$reply_count <- as.integer(df$reply_count)
head(df)
tail(df)

Filter out Retweets

Validate the number of rows before and after removing retweets.

nrow(df)
## [1] 6377
df <- df %>%
  filter(is_retweet == FALSE)
head(df)

Show the total number of rows in this dataset.

nrow(df)
## [1] 4765

Add Doc_Id incrementing per Row

df = df %>%
  mutate(doc_id = paste0("doc", row_number())) %>%
  select(doc_id, everything())
head(df)

Add a Column for Text Length

df$text_len <- str_count(df$text)

EDA

Simplify the Data Frame to just the target fields we’ll be using throughout the analysis.

df_select <- df %>%
  select(doc_id, user_id, status_id, screen_name, created_at, text, text_len)

Tweet Length

Number of tweets by each candidate

summary(df_select$text_len)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    94.0   184.0   171.5   256.0   316.0

Tweets for the dataset range from 7 characters to 316 with a mean of 172

df_select %>%
  ggplot(aes(x = text_len, fill = screen_name)) +
  geom_histogram(alpha = .5, color = "darkgray", bins = 33) +
  theme_minimal()

Trumps tweets are a little more right skewed than Biden’s, which appear to be a little left skewed (tend to be longer). Both candidates have a very high number of short tweets (~20 characters)

Total Number of Tweets

df_select %>%
  group_by(screen_name) %>%
  summarise(total = n(), .groups = "keep") %>%
  ggplot(aes(x = screen_name, y = total, fill = screen_name)) +
  geom_col() + 
  theme_minimal() +
  geom_text(aes(label = total), position = position_stack(vjust = 0.5))

After removing Retweets, Trumps total dropped to only 1,733, while Biden’s fell much less to 3,032. Trump Retweets more often than Biden.

Tweets Over Time

df_select %>%
  dplyr::group_by(screen_name) %>%
  ts_plot("days", trim = 0L) +
  ggplot2::geom_line() +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.title = ggplot2::element_blank(),
    legend.position = "right",
    plot.title = ggplot2::element_text(face = "bold")) +
  ggplot2::labs(
    x = NULL, y = NULL,
    title = "Frequency of Tweets by Candidate by Day"
  )

Based on the way this dataset was collected (last 3200 tweets per candidate), the data from Trump is compressed to dates starting around August. This is due to the fact that he tweets more regularly than Biden, hitting the 3,200 tweet limit sooner.

Clean and Prepare the Text for Analysis

Get rid of various useless text like URLS and shortened URLs. These appear frequently in the text and skew results.

Clean Twitter specific items

https://stackoverflow.com/questions/31348453/how-do-i-clean-twitter-data-in-r

df_select$text = gsub("[ \t]{2,}", " ", df_select$text)
df_select$text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", df_select$text)
df_select$text = gsub("@\\w+", " ", df_select$text)
df_select$text = gsub("[[:digit:]]", " ", df_select$text)
df_select$text = gsub("http\\w+", " ", df_select$text)
df_select$text = gsub("^\\s+|\\s+$", " ", df_select$text)

Remove Whitespace and Lemmatize

df_select$text = stripWhitespace(df_select$text)
df_select$text = lemmatize_strings(df_select$text)

Unnest_Tokens()

Create a new column with each word on it’s own row.

tidy_df <- df_select %>%
  unnest_tokens(word, text)

Stop Word Removal

remove stop words and custom stop words from the results.

Note: Removing Trump because it’s a proper name but also an verb which was marked as positive and appeared in the top results for both candidates.

custom_stop_words <- bind_rows(tibble(word = c("trump"), lexicon = c("custom")), stop_words)

tidy_df <- tidy_df %>%
  anti_join(custom_stop_words, by = "word")

Validate the New number of Rows

Dramatically larger now that each word from text is in it’s own row.

nrow(tidy_df)
## [1] 54903

After unnesting the words, each word of the tweet is on a separate line. The following is an example.

tidy_df %>%
  filter(status_id == "x1319032499456987136") %>%
  head(n=10)

Sentiment Analysis

Bing Sentiment Lexicon

Using the Bing Lexicon from Bing Liu and collaborators, adds the column “Sentiment” and mark each word as positive or negative.

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

bing_df <- tidy_df %>%
  inner_join(get_sentiments("bing"), by = "word")
bing_df %>%
  group_by(screen_name, sentiment) %>%
  summarise(count = n(), .groups = "keep")

AFINN scoring Lexicon

AFINN from Finn Årup Nielsen, adds the value column, with a numeric representation of how positive, or negative the word is. The AFINN lexicon measures sentiment with a numeric score between -5 and 5

http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html

afinn_df <- tidy_df %>%
  inner_join(get_sentiments("afinn"), by = "word")

head(afinn_df)
afinn_df %>%
  ggplot(aes(x = value, fill = "#F8766D")) +
  geom_histogram(bins = 10, show.legend = FALSE) +
  scale_x_continuous(breaks = c(-5, -3, -1, 1, 3, 5)) +
  theme_minimal()

For the dataset overall, there is a slight left-skew showing there is a greater concentration of words with positive values. There are very few in the high and low values (-4,-5, 4, 5).

NRC Sentiment Lexicon

NRC from Saif Mohammad and Peter Turney. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions as well as positive and negative sentiment.

One thing to note, single words can have multiple emotions

nrc_df <- tidy_df %>%
  inner_join(get_sentiments("nrc"), by = "word")

Total counts for all 8 emotions and 2 sentiments.

nrc_df %>%
  group_by(sentiment) %>%
  summarise(total = n(), .groups = "keep") %>%
  arrange(desc(total))

Inspect Top Words Per Candidate

Using various methods, inspect what words are most frequently used, per candidate, proportions of negative and positive words, and trend over time.

Top Word Counts (BING)

bing_df %>%
  count(word, sort = TRUE, sentiment) %>%

  group_by(sentiment) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  theme_minimal() +
  labs(x = "Contribution to sentiment",
       y = NULL)

When looking at data as a whole, we can see the top negative word is crisis and positive is win. Next, we’ll split these out by candidate.

Top Word Count Per Candidate (BING)

bing_df %>%
  count(word, sort = TRUE, screen_name, sentiment) %>%

  group_by(screen_name, sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(screen_name, sentiment), scales = "free_y") +
  theme_minimal() +
  labs(x = "Contribution to sentiment",y = NULL)

Negative: Biden has the top word crisis in his list, where Trump doesn’t. Biden uses threat, hate, fear, and lie in his negative list where Trump has fake, crime, radical, corrupt, and crazy.

Positive: Both candidates show win as the top word and support, strong, honor, and protect in their top 10. Biden uses promise, safe, protect, and love. Trump uses endorsement, fast, congratulations, incredible and happy.

Overall Top Words (BING)

bing_df %>%
  count(word, sort = TRUE, sentiment, screen_name) %>%

  group_by(screen_name) %>%
  top_n(30) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = TRUE) +
  facet_wrap(~screen_name, scales = "free_y") +
  theme_minimal() +
  labs(x = "Contribution to sentiment", y = NULL)

Looking at the mix of top 30 words and whether they’re positive or negative. Both candidates are about the same. There is a mix of positive and negative words throughout. Biden tends to have a little more postive words clustered at the top where Trump is more mixed.

Top Words Sorted by AFINN Score

afinn_df$color <- ifelse(afinn_df$value < 0, "Negative","Positive")

afinn_df %>%
  count(word, sort = TRUE, screen_name, value, color) %>%
  group_by(screen_name) %>%
  top_n(30) %>%
  ungroup() %>%
  mutate(word = reorder(word, value)) %>%
  
  ggplot(aes(value, word, fill = color)) +
  geom_col(show.legend = TRUE) +
  facet_wrap(~screen_name, scales = "free_y") +
  theme_minimal() +
  labs(x = "AFINN Sentiment Score", y = NULL)

Top 30 words, sorted by their AFINN score, a scale of -5 to 5. Biden used slightly more positive words in his top 30. Both seem generally balanced when sorted this way.

Overall Sentiment per Candidate (AFINN)

afinn_df %>%
  count(word, sort = TRUE, screen_name, value) %>%
  group_by(screen_name) %>%
  summarize(avg = mean(value * n), .groups = "keep") %>%
  
  mutate(screen_name = reorder(screen_name, avg)) %>%
  ggplot(aes(avg, screen_name, fill = screen_name, label = avg)) +
  geom_col() +
  labs(x = "Average sentiment Score", y = NULL) +
  theme_minimal() +
  geom_text(position = position_dodge(width = 0.9), vjust = -0.5) +
  coord_flip()

The MEAN sentiment score is a mean calculation using the AFINN lexicon. As a remember this scale can go from -5 to +5. The calculation takes into account the AFINN sentiment score multiplied by the number of times that work occurs and then calculates the MEAN. We see that Biden’s is approximately +1 while Trumps is just over 0.

Sentiment of Tweets over Time (AFINN)

plot_df2 <- afinn_df %>%
  filter(created_at > "2020-08-01") %>%
  mutate(mon = floor_date(created_at, "day")) %>%
  group_by(screen_name, mon) %>%
  summarize(value = mean(value), .groups = 'keep')

plot_df2$color <- ifelse(plot_df2$value < 0, "negative","positive")

ggplot(plot_df2, aes(mon, value, fill = color)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~screen_name, ncol = 1, scales = "free_x") +
  labs(x = NULL, y = "Frequency of Sentiment") +
  theme_minimal()

Since August, showing the frequency of sentiment for each candidate. Using the mean sentiment score we can see how each candidate tweeted on a daily basis. Trumps earlier timeline tended to be more negative, while both candidates were more postive starting mid September.

Find most positive and negative individual messages by user (AFINN)

Most Positive Biden Messages

afinn_df %>%
  filter(screen_name == "JoeBiden") %>%
  group_by(doc_id) %>%
  summarize(total_value = sum(value), word_count = n(), .groups = "keep") %>%
  arrange(desc(total_value))
df %>%
  filter(doc_id == "doc2737" | doc_id == "doc2894")

Most Negative Biden Messages

afinn_df %>%
  filter(screen_name == "JoeBiden") %>%
  group_by(doc_id) %>%
  summarize(total_value = sum(value), word_count = n(), .groups = "keep") %>%
  arrange(total_value)
df %>%
  filter(doc_id == "doc2504" | doc_id == "doc623") %>%
  select(text)

Most Positive Trump Messages

afinn_df %>%
  filter(screen_name == "realDonaldTrump") %>%
  group_by(doc_id) %>%
  summarize(total_value = sum(value), word_count = n(), .groups = "keep") %>%
  arrange(desc(total_value))
df %>%
  filter(doc_id == "doc3173" | doc_id == "doc3708") %>%
  select(text)

Most Negative Trump Messages

afinn_df %>%
  filter(screen_name == "realDonaldTrump") %>%
  group_by(doc_id) %>%
  summarize(total_value = sum(value), word_count = n(), .groups = "keep") %>%
  arrange(total_value)
df %>%
  filter(doc_id == "doc3190" | doc_id == "doc4219") %>%
  select(text)

Word Clouds

Overall Word Cloud

bing_df %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 150))

Word Clouds for Each Candidate

Biden

bing_df %>%
  filter(screen_name == "JoeBiden") %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 150))

Trump

bing_df %>%
  filter(screen_name == "realDonaldTrump") %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 150))

bing_df %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 200)

Checking for Top-Negation Word Pairs

The method used here doesn’t take into account negation of words like no, and never preceding words. We can quckly check the top word pairs to see how they occur in our text.

bigrams <- df_select %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigram_counts <- bigrams %>%
  count(bigram, sort = TRUE) %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigram_counts
negate_words <- c("not", "without", "no", "can't", "don't", "won't")

bigram_counts %>%
  filter(word1 %in% negate_words) %>%
  count(word1, word2, wt = n, sort = TRUE) %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  mutate(contribution = value * n) %>%
  group_by(word1) %>%
  slice_max(abs(contribution), n = 8) %>%
  ungroup() %>%
  mutate(word2 = reorder_within(word2, contribution, word1)) %>%
  ggplot(aes(contribution, word2, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free", nrow = 3) +
  scale_y_reordered() +
  labs(x = "Sentiment value * # of occurrences",
       y = "Words preceded by a negation")

df_select %>%
  select(screen_name, text_len) %>%
  summary()
##           screen_name      text_len    
##  JoeBiden       :3032   Min.   :  7.0  
##  realDonaldTrump:1733   1st Qu.: 94.0  
##                         Median :184.0  
##                         Mean   :171.5  
##                         3rd Qu.:256.0  
##                         Max.   :316.0